Genome-wide analysis highlights genetic admixture in exotic germplasm resources of Eucalyptus and unexpected ancestral genomic composition of interspecific hybrids

Eucalyptus is an economically important genus comprising more than 890 species in different subgenera and sections. Approximately twenty species of subgenus Symphyomyrtus account for 95% of the world’s planted eucalypts. Discrimination of closely related eucalypt taxa is challenging, consistent with their recent phylogenetic divergence and occasional hybridization in nature. Admixture, misclassification or mislabeling of Eucalyptus germplasm resources maintained as exotics have been suggested, although no reports are available. Moreover, hybrids with increased productivity and traits complementarity are planted worldwide, but little is known about their actual genomic ancestry. In this study we examined a set of 440 trees of 16 different Eucalyptus species and 44 interspecific hybrids of multi-species origin conserved in germplasm banks in Brazil. We used genome-wide SNP data to evaluate the agreement between the alleged phylogenetic classification of species and provenances as registered in their historical records, and their observed genetic clustering derived from SNP data. Genetic structure analyses correctly assigned each of the 16 species to a different cluster although the PCA positioning of E. longirostrata was inconsistent with its current taxonomy. Admixture was present for closely related species’ materials derived from local germplasm banks, indicating unintended hybridization following germplasm introduction. Provenances could be discriminated for some species, indicating that SNP-based discrimination was directly proportional to geographical distance, consistent with an isolation-by-distance model. SNP-based genomic ancestry analysis showed that the majority of the hybrids displayed realized genomic composition deviating from the expected ones based on their pedigree records, consistent with admixture in their parents and pervasive genome-wide directional selection toward the fast-growing E. grandis genome. SNP data in support of tree breeding provide precise germplasm identity verification, and allow breeders to objectively recognize the actual ancestral origin of superior hybrids to more realistically guide the program toward the development of the desired genetic combinations.


Introduction
This DNA assay has been valuable as it provides simultaneous discovery and genotyping of Single Nucleotide Polymorphisms (SNP) within and across species, facilitating genus-wide phylogenetic studies. However, some challenges remain for this SNP genotyping method due to variable sequencing coverage and irregular sampling of loci causing variable genotype reproducibility and ultimately limited data portability across studies in highly heterozygous genomes such as those of the eucalypts [28][29][30]. The development of Eucalyptus multispecies SNP arrays based on industry-level "gold standard" technology has provided a worldwide usable platform allowing seamless and precise data exchange across studies [31].
Although species discrimination using DNA data is largely settled, less attention has been devoted to looking at provenance variation within species. This is particularly important for breeding programs that take advantage of matching distinctive provenance characteristics to specific sites in exotic environments, or aim at deliberately exploiting provenance and species complementarity by building specific genomic compositions by interspecific hybridization [2,12,18]. Likewise, few studies have examined the possibility of using DNA data to describe the actual genomic ancestral composition of hybrids, including those derived from more than two parental species. Knowledge of the actual genomic composition of complex hybrids of distinctive performance would allow directing more deliberate selection strategies in hybrid breeding programs. Earlier studies using microsatellite markers indicated that provenances of E. grandis could be distinguished but not for E. urophylla and E. camaldulensis, and some hybrid clones could be assigned to their most likely ancestral species, although with incomplete resolution [32]. Using SNP data, preliminary analyses have shown that provenances within species could be distinguished for E. grandis and E. urophylla [31,33] but not for E. camaldulensis, consistent with the latter being more prone to hybridization or a remnant of an ancient widespread taxon [8].
The current eucalypt SNP arrays have been used to estimate recombination rates and carry out dense linkage mapping [34], build relationship matrices for genomic selection in several species, reviewed in [35], and understand the consequences of artificial selection [36]. No studies to date, however, have evaluated their ability to characterize germplasm material in gene banks. Questions frequently arise regarding the verification of the alleged species classification, the possibility of discriminating provenances and determining the genomic composition of hybrid clones of unknown or uncertain origin derived from successive generations of interspecific recombination. In this study we examined a large set of germplasm accessions including 440 Eucalyptus trees of 16 species and 44 interspecific hybrids currently conserved or used in Brazil. We used genome-wide SNP data to evaluate the agreement between the alleged phylogenetic classification of species and provenances as registered in their historical records, and their observed genetic clustering obtained from genomic data, agnostic to any prior phylogenetic information. We focused on the main planted species of Symphyomyrtus given their outstanding relevance in terms of germplasm use and conservation. Additionally, we used SNP data to examine the actual genomic makeups of hybrids derived from interspecific crosses involving two of more species, and compare them with their expected composition based on the recorded ancestral species.

Plant material
The study involved a germplasm set of 440 trees belonging to 16 Eucalyptus species of five sections of subgenus Symphyomyrtus and 44 interspecific hybrid clones (Table 1). These trees are conserved in species/provenance/progeny trials and clonal banks at the Anhembi Experimental Research Station of the Institute for Forestry Research (IPEF) in Brazil (22.7897˚S, 48.1280˚W), or in the gene banks of some associated forest-based companies. For six species (E. grandis, E. longirostrata, E. pellita, E. robusta, E. saligna and E. tereticornis), samples were analyzed for more than one provenance. The original locations of the species and provenances were plotted on top of a base map of world country boundaries shapefile of Australia, publicly available under a Creative Commons Attribution 4.0 International Public License (https:// datacatalog.worldbank.org/search/dataset/0038272/World-Bank-Official-Boundaries) using the R package tmap (Fig 1). Plant material included: (1) individual trees sampled in species/ provenance trials established with original seeds collected in Australia for which the CSIRO (Commonwealth Scientific and Industrial Research Organisation) seedlot number is known, and (2) individual trees collected in Brazilian germplasm banks at least one generation removed from the original introductions, maintained by IPEF or by three associated forestry companies (Suzano, Klabin, Vallourec and CMPC Celulose Riograndense), sometimes with unknown provenance origin (Table 1). In addition, 44 interspecific hybrid clones obtained by controlled interspecific crosses of two or more species were studied to compare their SNP-realized versus pedigree-expected genomic composition. These hybrids were grouped into five

SNP genotyping and filtering
Total genomic DNA was extracted with an optimized Sorbitol/CTAB protocol [37]. DNA samples were sent to ThermoFisher (Santa Clara, CA) for SNP genotyping with the 72K Eucalyptus Axiom Array developed for Eucalyptus and Corymbia species (https://www.thermofisher. com/order/catalog/product/551134; Grattapaglia D. and Silva-Junior O.B., unpublished). This Axiom Array is a second-generation Eucalyptus SNP platform with 68,055 SNPs specific to the Eucalyptus genome, 28,177 of them shared with the previously developed Infinium EUChip60k [31], and 4,147 specific to the genome of its sister genus Corymbia, these latter ones not used in this study. SNPs with more than 10% missing data and with minor allele frequency (MAF) below 5% were removed using PLINK v1.9 [38] using parameters-maf 0.05 -geno 0.1. A total of 48,645 SNPs passed these filtering thresholds. The dataset was further pruned of 21,347 SNPs that were in linkage disequilibrium (LD) with other markers to remove redundant information  and avoid regions of the genome with a disproportionate influence on the results, that could potentially distort the representation of genome-wide structure [39]. LD pruning was performed using PLINK parameter-indep-pairwise 50

Statistical and population genetics analyses
Basic population genetics parameters were estimated, such as the average minor allele frequency (MAF), observed (H o ) and expected heterozygosity (H e ). Analyses were performed in R (R Core Team 2020) using packages adegenet v2.1.3 [40] and hierfstat v 0.5-7 [41]. The data was input into R in FSTAT format after transformation with PGDSpider v2.1.1.5 [42]. fastSTRUCTURE v. 1.0 [43] was run with the 27,298 SNPs to infer population structure for the 484 individual trees. Analyses were performed with the number of clusters K varying from 2 to 30 and option-seed = 100. The input was the binary version (BED) of the PED file from PLINK. The most likely model was selected using the supervised estimators of [44] implemented in the StructureSelector [45] web server (https://lmme.ac.cn/StructureSelector/). Cluster assignment for each of the samples was visualized with barplots in R, using packages pophelper v2.3.0 [46] and gridextra v2.3 [47]. Additional fastSTRUCTURE analyses were carried out separately for individual eucalypt sections and species to assess resolution at within taxa levels for provenances differentiation.
Genomic composition of the hybrids was initially obtained from the unsupervised inference provided by fastSTRUCTURE, and compared with the recorded pedigree information. Specifically, for fastSTRUCTURE annotation, we used the meanQ file, which provided the probabilities of each sample belonging to each of the clusters found. Subsequently, a supervised analysis was carried out using ADMIXTURE, a software for model-based estimation of ancestry in unrelated individuals [48]. For this analysis, samples defined as being from pure species with 99% probability in the initial fastSTRUCTURE analysis, were used as reference populations to infer the genomic composition of the hybrids' genomes. A simple matching genetic distance among individual trees was also estimated and groups visualized with a principal component analysis (PCA) on the genetic distance matrix, where distances among trees were represented in a cartesian graph with PC1 and PC2. These analyses were performed in R using packages adegenet [40] ape v5.4 [49] and pegas v0.14 [50]. PCA biplots were visualized using ggplot2 v.3.3.2 package [51].

SNP diversity across species
After filtering and LD pruning, the final SNP dataset of 27,298 SNPs (S1 File) had a very low percentage of missing data (<3%) for all germplasm sets (species, provenances and hybrids), corroborating the good performance of the multi-species SNP array for population genomics and molecular breeding across eucalypt taxa. The percentage of polymorphic loci per population ranged from 29.7% for E. deglupta to over 93% for E. urophylla and the hybrids involving crosses between distant sections Latoangulatae and Maidenaria (Table 3). Overall, there was no significant difference in the proportion of polymorphic SNPs among the different eucalypt sections (ANOVA F-value = 1.14, p-value = 0.37). The average MAF across taxa was similar, within 0.1 and 0.15 for most taxa but E. urophylla, E. grandis, E. camaldulensis and the hybrids had a slightly higher average MAF.

Population structure analysis
StructureSelector analysis of the fastSTRUCTURE results indicated the most likely model with K = 18 taxonomic clusters (S2 File). This model correctly assigned each of the 16 species to a different cluster (Fig 2; S3 File). In the case of E. saligna some individuals were additionally separated according to provenance and the hybrids were assembled in a separate highly admixed cluster (Fig 2). Admixture at the individual level was seen in allegedly pure species trees. Some E. camaldulensis individuals were classified as being admixed with E. tereticornis, some E. urophylla individuals admixed with E. grandis, and a few additional admixed individuals were seen that were expected pure (Fig 2). At the higher taxonomic level of sections within subgenus Symphyomyrtus, models with smaller numbers of clusters easily separated eucalypt sections. For example, at K = 2, section Maidenaria detached from the rest. With K = 3, Latoangulatae and Maidenaria split, with occasional admixture seen in individuals of some species. With K = 4, species of Exsertaria separated from the other sections. Surprisingly, however, E. longirostrata that belongs to section Exsertaria, was clustered together with E. deglupta and E. argophloia that belong to two other different sections (S4 File). The SNPs dataset could not differentiate most provenances within species when all individuals were analyzed together, except for E. saligna from Kroombit Tops (Fig 2). This provenance was assigned to a separate group from Helidon and Richmond Range provenances, which in turn were clustered together. All the provenances for the other species (E. tereticornis, E. grandis, E. longirostrata, E. pellita, E. pilularis and E. robusta) could not be discriminated even at higher K's (S5 File). Only when species were analyzed individually, fastSTRUCTURE modeling resolved some of the provenances. This was the case of the two E. grandis provenances from Atherton and Coffs Harbor and E. robusta from Brisbane and Byfield. Somewhat separate clustering was also seen for E. longirostrata from Starkvale, and E. tereticornis from Mount Garnet, although some individuals either displayed admixture or were not clustered accordingly (Fig 3). Lastly, provenances of some species clearly could not be distinguished. This occurred with E. pellita, E. longirostrata from Coominglah and Goodger and with E. tereticornis from Mitchell Road (Oaky Creek) and Mareeba.

Determination of ancestral species composition of hybrids
The ancestral genomic composition of hybrids estimated with both fastSTRUCTURE and ADMIXTURE were compared to their respective pedigree expected composition (Fig 4; S6  File). The supervised analysis carried out using ADMIXTURE resulted, in general, in similar genomic composition as those obtained with fastSTRUCTURE, although some differences were seen for example in Hybrids 1, where the fastSTRUCTURE model indicated the unexpected presence of E. tereticornis genome. Overall, there were only nine out of the 41 hybrids for which the SNP-based composition closely matched the pedigree expected one. This happened for hybrids Hyb-31, Hyb-32, Hyb-33, Hyb-34, Hy-35, Hyb-36, Hyb-38, Hyb-39, Hyb-40 e Hyb-41, almost all of them simple F 1 hybrids. For all other hybrids, small to large deviations were observed.
For a considerable number of hybrids, additional unanticipated species from those recorded in the pedigree, were observed in their composition (Fig 4). For example, hybrids Hyb-1 through Hyb-11 in the Hybrids 1 group were expected to be F 1 's of E. urophylla and E. camaldulensis. However, eight of them showed variable amounts of E. grandis genome in the ADMIXTURE analysis while the fastSTRUCTURE model suggested the presence of E. tereticornis genome more frequently than that of E. camaldulensis. The unexpected presence of E. grandis genome was again seen in several other hybrids in the Hybrids 2 group (ex. Hyb-17, Hyb-18, Hyb-19, Hyb-26). Furthermore, in this group of hybrids none or a considerably less than expected proportion of the genome was detected coming from the recorded species of Maidenaria involved in the crosses, namely E. dunnii and E. globulus. E. dunnii genome was not detected in six of the 14 hybrids and E. globulus in seven of 12 where it should have been observed (Fig 4). For example, in hybrids Hyb-13, Hyb-17, Hyb-18, Hyb-19, Hyb-21 and Hyb-26 expected to be F 1 hybrids of Latoangulatae species (E. urophylla, E. grandis or E. saligna) with Maidenaria species (E. dunnii or E. globulus), the SNP data showed little or no sign of the two temperate species genomes and an unexpected or larger than expected proportions of the genome of E. grandis. Finally, there were cases where the presumed genomic composition was completely different from the SNP-estimated one. For example, hybrid Hyb-42 was expected to be a E. dunnii x E. globulus hybrid, when in fact it involved mainly species of Latoangulatae with E. camaldulensis, suggesting mislabeling.

Genetic distances among species, provenances and hybrids
Overall, the PCA plot based on the genetic distance matrix positioned the different species and sections as expected, clustering phylogenetically closer species of the same section (Fig 5). A clear exception, however, was seen for E. longirostrata, taxonomically classified in section Latoangulatae. The PCA placed it away from Latoangulatae and closer to E. argophloia and E. deglupta. These two species belong to two different sections but they clustered together, considerably separated from all other species. In most cases, the PCA analysis had no resolution to discriminate provenances within species. In line with the fastSTRUCTURE results, exceptions were the Kroombit Tops provenance of E. saligna, and the two provenances each of E. robusta and E. grandis that were separated in the PCA.

Genome-wide eucalypt species SNPs diversity
Consistent with the initial validation data provided alongside the EuCHIP60k development [31], our results corroborate that the current SNPs arrays platforms offer effective power to carry out genetic diversity analysis of the main eucalypt species planted worldwide. Within species, the proportion of polymorphic SNPs showed some variation, although for the vast majority, over 40% of the SNPs were informative and the average MAF was generally above 0.13, despite the relatively limited sample sizes analyzed (Table 3). Higher proportions of polymorphic SNPs, above 68% up to 93%, and higher average MAF were observed for E. grandis, E. camaldulensis and E. urophylla. These results may be explained in part from the somewhat larger sample sizes analyzed. In the case of E. urophylla the alleged mixture of provenances might have contributed to the higher diversity. A second explanation for the higher SNP diversity in these three species involves potential admixture due to unintended interspecific hybridization. These three species are widely used to generate interspecific hybrids in Brazil and the structure analysis results indicated admixture in the E. urophylla and E. camaldulensis trees (see below).
A third possible explanation for the higher SNP diversity observed in E. grandis, E. camaldulensis and E. urophylla is some ascertainment bias derived from the discovery panels used in the initial SNP discovery for the development of the EuCHIP60K. Although SNP discovery was carried out on sequence data for 240 trees of 12 species, a proportionally larger amount of sequence data was obtained for these three species when compared to the others [31]. Large proportions of informative SNPs (58-60%) were also seen for species of Maidenaria, consistent with the fact that E. globulus was also an important target of sequence production during SNP discovery. Larger proportions of polymorphic SNPs and higher average MAF were also observed in the different hybrids. This was evidently expected, given the transmission to the hybrid of alternative SNP alleles fixed in each parental species.
Except for E. urophylla, E. grandis, E. camaldulensis and the hybrids, the results suggest that the rate of SNP polymorphism might depend more on the level of genetic diversity captured in the specific sample of individuals than on the particular species analyzed. This in turn indicates that the SNP set used delivers largely equivalent numbers of polymorphic SNPs between any pairwise taxa within the main sections of subgenus Symphomyrtus. This indicates good potential for the selection of ancestry informative SNPs sets [52] that appear in substantially different frequencies between species, provenances or populations in this phylogenetic group. The expansion of the number of species and provenances and the specific selection of ancestry informative SNPs at the species and provenances levels would constitute an obvious follow-up of this study.

SNPs recover the expected species structure but admixture is present
Genome-wide SNP data provided the necessary resolution to check and validate the phylogenetic classification of germplasm sources of the eucalypt species sampled in this study. The most likely model for the SNP dataset found k = 18 clusters, allowing clearcut discrimination of the five sections and the 16 species sampled of subgenus Symphyomyrtus, while reliably indicating the admixed composition of hybrids (Fig 2). This result substantiates what a number of previous phylogenetic studies have shown using different types of DNA marker data such as ribosomal ITS, chloroplast DNA, microsatellites and DArT (reviewed in [2]), and more recent studies that further expanded the sampling of taxa and individuals within taxa [8]. The evolutionary history 'written' in the genome of these Symphyomyrtus species is generally consistent with their current phylogenetic organization within this subgenus.
Differently from several previous reports that examined germplasm sampled exclusively in their center of origin, our study included material conserved in exotic conditions from variable sources (Table 1). In general, for the species' germplasm that came directly from original sources in Australia, the genetic structure splits were clearcut. For species that included material from unknown origins or collected in germplasm banks established in Brazil, occasional admixture was seen. The E. camaldulensis trees showed admixture with E. tereticornis, E. viminalis individuals showed admixture with E. dunnii, and E. urophylla sampled from multiple provenances established in Brazil displayed significant admixture with E. grandis (Fig 2). For some of these germplasm sources our data indicate that accidental hybridization might have taken place once the germplasm was introduced in Brazil. In the exotic habitat under different ecoclimatic conditions, reproductive barriers between eucalypt species such as geographic distance and flowering phenology that maintain species apart in their natural range, are relaxed or even broken, facilitating hybridization [53]. The paradigmatic example is the famous eucalypt hybrid swarm of the Rio Claro Arboretum established upon the introduction of Eucalyptus species in Brazil in 1904 [53,54]. Several species were planted side by side, and seeds collected from that germplasm generated very heterogenous plantation forests, where some hybrids of unknown origin and outstanding performance were selected and are still planted or used in breeding programs today [17,18]. The results of our study point to the development of ancestry informative SNPs that should allow reconstructing and understanding the recombination history of these hybrids.
The E. viminalis germplasm sample also showed evidences of admixture with E. dunni and E. globulus at k = 18. This sample of trees was from an advanced generation germplasm source established in Brazil but with unknown origin in Australia. Hybridization between these temperate species of section Maidenaria once introduced in Brazil cannot be ruled out, although less likely than for the previously mentioned species of Exsertaria and Latoangulatae, since Maidenaria species flower and produce seed less conspicuously in the tropics [53]. When a model with k = 20 was tested, E. viminalis individuals clearly split (S5 File) with no evidence of hybrid constitution. This result highlights the long-standing challenge with admixture modeling, whereby the most likely selection of K clusters is a difficult problem to automate in a way that is effectively robust [39].
The graphical projection of the different species and hybrids in the PCA was generally consistent with the phylogenetic expectations (Fig 5). Complementing the structure analysis, the PCA provided additional information regarding the genetic distance among the different taxa. E. deglupta and E. argophloia were placed at a considerable distance from the main section of Symphyomyrtus. The fact that they clustered together was however unexpected, since they are classified in distinct sections. These two species are currently part of Symphyomyrtus [55] and while no contention exists regarding the classification of E. argophloia, E. deglupta has originally been classified in subgenus Minutifructis [4]. The three main sections of interest in the subgenus were clearly separated and contained the expected species, exception made for E. longirostrata that clustered away from its section Latoangulatae and distant from Exsertaria as well.
Samples of E. longirostrata have been examined in the most extended molecular phylogenetic study of terminal taxa of sections Maidenaria, Exsertaria and Latoangulatae to date [8].
That study produced a phylogeny that largely matched the morphological treatment of sections, although sections Exsertaria and Latoangulatae were shown to be polyphyletic. Several inconsistencies between the morphological classification and the molecular phylogeny were described, and a number of taxa in Latoangulatae were deemed polyphyletic at the species level. A polyphyletic group is one that shows mixed evolutionary origin, descended from more than one ancestor, with taxa sharing homoplasies, typically explained as a result of convergent evolution, complicating the correct taxonomical classification [56]. E. longirostrata was itself deemed polyphyletic, classified within series Lepidotae-Fimbriatae and clustered into Latoangulatae IV, a clade considerably distant from Latoangulatae II where E. grandis, E. pellita, E. robusta and the section type species E. saligna belong. Furthermore, those authors suggest that all Latoangulatae species other than those in Latoangulatae II would be better placed in other taxonomic sections to reflect the phylogeny revealed in their study. The most recent classification of the eucalypts [14,55] however, classified E. longirostrata into a different section, Pumilio. In our study, the sharp split of E. longirostrata from Latoangulatae and Exsertaria (Fig 5), provides further molecular evidence for this most recent taxonomic classification placing the species in a separate section.

Provenance discrimination is strongly dependent on geographical distance
With the exception of one provenance of E. saligna, all other Eucalyptus provenances could not be discriminated when all 484 samples were analyzed together (Fig 2). When species were analyzed separately, provenances could be discriminated for some species but not for others (Figs 3,5). Looking at the geographical position of the sampled provenances (Fig 1), a pattern emerged suggesting that SNP-based discrimination was strongly dependent of geographical distance. The two provenances of E. grandis (Atherton and Coffs Harbor), separated in the structure and PCA analyses, are located at more than 2,000 km apart. The same happened with provenances Byfield and Brisbane of E. robusta at~700 km from each other, and E. saligna Kroombit Tops provenance located at >700 km from the other two E. saligna provenances. All other provenances that were loosely or otherwise not discriminated are located at less than 200-300 km apart. These results indicate an isolation-by-distance (IBD) model of population structure for the provenances sampled for these species. The genetic similarity between populations will decrease exponentially as the geographic distance between them increases, because of the limiting effect of geographic distance on rates of gene flow [57].
A number of studies in Eucalyptus have looked at the prevalence of genetic structure between populations located at various geographic distances. These studies have generally shown that an IBD model fits well the observed data, with genetic distances between provenances strongly positively correlated with geographic distances [24,58,59]. A recent landscape study based on very dense DNA data obtained by whole genome sequencing in E. albens and E. sideroxylon, also found strong support for IBD in both species [60]. Taken together, ours and others' results indicate that clearcut distinction of Eucalyptus germplasm sources in what regards provenance variation, might not be straightforward even with a dense panel of SNPs, unless provenances are geographically distant or provenance-informative SNP markers are specifically identified and used. As a result, what breeders may call as different provenances could in effect be members of the same continuous population despite several kilometers of physical distance, if gene flow is ubiquitous. It must be mentioned, however, that our study suffered from limited and somewhat uneven sampling of provenances that might have contributed to a greater difficulty in distinguishing some of them. It has been shown that subpopulations with reduced sampling tend to be merged together in genetic structure analyses, and uneven sampling may lead to downward-biased estimates of the true number of subpopulations [44]. Larger sample sizes for the provenances studied should allow better estimation of allele frequencies and possibly selection of ancestry informative, provenance-specific SNPs for greater discrimination power.

Genomic composition of hybrids indicates directional selection toward tropical genomes
Our genome-wide data showed that the majority of the hybrids studied (35 out of 44) displayed genomic composition deviating from the expected one based on pedigree information (Fig 4). This result is important in view of the long standing and widespread adoption of deliberate breeding strategies toward the selection of elite hybrid clones with specific anticipated genomic composition, especially in tropical countries (reviewed in [12]). This in turn highlights one more important application of using dense, high-quality array-based SNP data in support of breeding programs. SNP data not only provide precise germplasm identity verification, but more importantly allow the breeder to objectively recognize the actual ancestral origin of superior hybrids in order to discard unwanted hybrid combinations or to more realistically guide the breeding program toward the development of the desired genetic material.
For the sample of hybrids studied in this work, the lack of adherence between the expected genomic composition and the actual one suggests at least two hypotheses. Notwithstanding the possibility of mislabeling errors during controlled crosses, as likely the case for hybrids Hyb-13, Hyb-14 and Hyb-42, the second and most probable hypothesis is pervasive genetic admixture of the parents involved in the original interspecific cross. Given the frequently unknown introduction history, followed by local intermating in Brazil in the last 120 years, as discussed previously, there is a considerable possibility that the presumed parents were themselves misclassified. Moreover, because hybrids tend to be produced by crossing good performing parents in the breeding program, it is quite possible that actually some of the parents used were themselves hybrids, distorting the expected composition of the resulting hybrid offspring. Species within the same sections of Symphyomyrtus that display overlapping morphological features and easily hybridize would be more prone to such occurrences. Clearcut examples were six supposedly F 1 hybrids that in principle did not involve E. grandis, but where the SNP data revealed its presence (Hyb-13, Hyb-17, Hyb-18, Hyb-19, Hyb-26, Hyb-42). Likewise, several F 1 hybrids of E. urophylla with E. camaldulensis (Hybrids 1 group) showed variable amounts of E. grandis genome in their composition, and the presence of E. tereticornis genome more frequently than that of E. camaldulensis (Fig 4). Admixture of E. grandis genome into the E. urophylla parents and difficulties in morphologically discriminating E. camaldulensis germplasm from E. tereticornis could readily explain these results.
Besides the presence of E. grandis as an unexpected species in the genomically realized pedigree, the observation of larger than expected proportions of E. grandis genome was also seen for all hybrids where this species was involved. Fourteen hybrids derived from advanced generation recombinant intercrosses involved one or both hybrid parents with three or more species represented, E. grandis being one of them (ex. Hyb-14, Hyb-15, Hyb-16, Hyb-20, Hyb-22 through Hyb-25, Hyb-27 through Hyb-29, Hyb-32, Hyb-43 and Hyb-44) ( Table 2). The pedigree-expected proportions were estimated based on the final presumed participation of each single species in the pedigree, assuming balanced Mendelian inheritance and recombination rates in the previous hybrid generations with no selection. For all these 20 hybrids, the SNP data showed, however, a consistently higher proportion of E. grandis genome in the hybrid composition. Aside from unintended admixture in the original parents, the ubiquitous unexpected presence or higher than anticipated proportion of E. grandis genome in the vast majority of hybrids, strongly suggests genome-wide directional selection for this species' genome throughout the breeding history of these complex hybrid clones. This should not be surprising given that volume growth is the main breeding target, and that E. grandis is well known for its fast growth [53]. Our data therefore not only corroborates the pivotal role of E. grandis in hybrid breeding, but also shows that its actual participation is considerably larger than expected and frequently unintended. Moreover, our data also demonstrate that in hybrids between species of Latoangulatae and Exsertaria with species of Maidenaria (Hybrids 2 group), the actual participation of the latter, such as E. globulus. E. dunnii and E. benthamii in the final hybrid's genome composition is less than expected, consistent with strong selection against the less adapted temperate genomes in tropical environments.

Concluding remarks
In conclusion, we have shown that the current Eucalyptus multi-species SNP array platform, provides a valuable tool to look at within taxa variation in Symphyomyrtus, to investigate population structure and track the genomic ancestry of individual clones. As the current "gold standard" in the high-throughput SNP genotyping industry, SNP arrays provide full data portability across studies carried out at different times. This represents a crucial advantage for the construction of legacy SNP databases for multiple Eucalyptus species and populations when compared to reduced representation genotyping by sequencing methods. SNP array data portability across studies allows effortless data consolidation across time for comparative studies and meta-analyses, that should be valuable for resolving taxonomic issues that still persist in the eucalypts. We are aware, however, that for eucalypt species phylogenetically distant from subgenus Symphyomyrtus, the current SNP array will not provide equivalent numbers of informative SNPs due to a higher genomic divergence [31].
We have also shown that while species classification is well resolved at the genome-wide level, provenance discrimination is not always so. It depends essentially on geographical distance, consistent with an isolation by distance model, and likely to be impacted by sample size. Further studies with larger samples sizes and the identification of provenance specific SNPs are warranted. Finally, our results are novel in that they objectively show, based on SNP data, that unplanned genetic admixture should not be a surprise in exotic germplasm sources not only in Brazil but likely in other countries, especially among phylogenetically closer species that easily hybridize in exotic environments. Moreover, the genomic ancestral composition of control-crossed hybrids in Brazil indicated that strong selection takes place in favor of tropical genomes and more specifically that of E. grandis. SNP-based auditing of hybrids' genomic composition could be introduced as a standard practice in hybrid breeding programs to more truthfully guide the program toward the development of the desired genetic material.